Flood susceptibility mapping utilizing the integration of geospatial and multivariate statistical analysis, Erbil area in Northern Iraq as a case study

Climate extreme events such as floods and droughts in any area have a significant impact on human life, infrastructure, agriculture, and the economy. In the last two years, flash floods caused by heavy rainstorms have become frequent and destructive in many catchments in Northern Iraq. The present study aims to examine flash floods in the Erbil region, Northern Iraq using Remote sensing (RS), Geographic Information System (GIS), and Principal Component Analysis (PCA) for geomorphic data. PCA results revealed that 12 geomorphic parameters exhibited a significant correlation with two different statistical components. To facilitate practical application, ranks are assigned based on the calculated parameters for flood susceptibility mapping. Out of the 24 basins in the current study, three basins (16, 3, and 14) have the highest geomorphometric values (36–39), indicating the zone most susceptible to flash floods and making up a maximum area of 38.58% of the studied region. Six basins (4, 8, 9, 10, 12, and 15), which have geomorphometric values between 30 and 35 and cover a land area of 27.86%, are the most moderately vulnerable to floods. The remaining basins, which make up 33.47% of the research, are occasionally subject to floods and have geomorphometric scores below 30. The precision of the flood susceptibility mapping was validated using the bifurcation ratio and drainage density relationship as well as past flood damages, such as economic losses and human casualties. Most of the recorded injuries and fatalities took place in areas that were particularly prone to severe past flooding. Additionally, the investigation revealed that 44.56% of all populated areas are located in extremely vulnerable basins. The findings demonstrate a notable correlation between the identified flood-susceptible areas and the occurrence of past flood damage.

Data analysis. In this study, different data sources and methodologies are applied for identifying the most vulnerable flood areas (Fig. 2). The digital elevation model (DEM) from the Shuttle Radar Topographic Mission (SRTM) was used to extract streams and drainage basins from the study area. The SRTM is available for download from NASA's Earth Explorer website (https:// earth explo rer. usgs. gov/). It is a single-pass synthetic aperture radar interferometry (InSAR) campaign that provides unique DEM data with a resolution of 90 m. In contrast to the 10-m mesh DEM developed by the Geographical Survey Institute of Japan, the elevation error of the DEM ranges from + 7.4 m (forest areas) to 0.7 m (bare land), whereas the horizontal inaccuracy and shifting range from 0.13 arc-seconds for East/West to 0.19 arc-seconds for North/South. The high-resolution, open, accurate, comparable, and timely land use/land cover maps are derived from the ESRI 2020 Landcover database (https:// livin gatlas. arcgis. com/ landc over/). The database is a global product for a user-specified area of interest and time period (2018)(2019)(2020)(2021)(2022). The database was developed in conjunction with ESRI by Microsoft Planetary Computer and scaled using Microsoft Azure Batch using a deep learning model trained on 5 billion manually annotated Sentinel-2 pixels spread across 20,000 sample points worldwide. The final output comprises a composite of nine classes of LULC predictions over the required time period, with an overall accuracy of 86% worldwide, according to the validation points. In addition to the available geological map from The Iraq Geological Survey (http:// en. geosu rviraq. iq/), the soil map (1:50,000) was downloaded from the State Board for Land Reclamation, Minis- www.nature.com/scientificreports/ try of Agriculture, Iraq (http:// zeraa. gov. iq/). Soil classification has been developed using the FAO/UNEP international standard system based on the world soil map (FAO/ UNESCO, 1974). Annual rainfall data from 1981 to 2022 are available from NASA's Prediction of Worldwide Energy Resources (https:// power. larc. nasa. gov). The service offers precipitation data with a high resolution that was produced from the Integrated Multi-satellite Retrievals for the GPM component of NASA's Global Precipitation Measurement (GPM) program (IMERG). Global 0.1° × 0.1° latitude/longitude grids are used to resolve IMERG precipitation data. Rainfall data had a higher resolution than previous rainfall grid maps; thus, it was utilized to create a spatial rainfall distribution map. Additionally, annual rainfall maxima collected from the RainSphere database of CHRS at University of California, Irvine (https:// rains phere. eng. uci. edu/) for the period from 1983-2021 has been used for the frequency analysis. The Weibull distribution is commonly used for frequency analysis as well as risk and reliability analysis of the lifetimes of systems and their components. Its applications have been reported frequently in hydrology and meteorology. Weibull in 1939 introduced the equation for the analysis of food magnitude for the corresponding return periods. Plotting the distribution involves ranking the data. It is specifically used with the exponential distribution and the Rayleigh distribution. In addition to other forms of probability distributions, this technique's probability distribution is tied to the scale and shape parameters. The density function for this method is highly sensitive to changes in the value of the form parameter. Here, the rank of the food discharge for a particular distribution, the quantity of observations (N), and the frequency of occurrence (f), sometimes referred to as the probability percentage.
Moreover, historical data on the flood disaster was available through the OFDA/CRED International Disaster Database (EM-DAT), which provides crucial core information on the occurrence and consequences of 22,000 mass disasters that have occurred worldwide from 1900 to the present. The database is compiled from various sources, including non-governmental organizations, insurance providers, and academic institutions. EM-DAT is available to the general public at https:// www. emdat. be/. For this study, historical losses including human and economic damages recorded in EM-DAT across all flood events from 2006 through 2021 in the study area to validate the resultant final map.
The ArcGIS hydrology spatial analyst tool extracts streams and delineates watersheds in a sequential manner during subsequent processing, beginning with filling sinks to remove any errors from the raw DEM, identifying the flow direction from higher to lower, assigning direction codes, and delineating the flow accumulation to determine the higher possibility flow accumulation pixels. The drainage network was built through a trialand-error method, taking into account pixels with flow accumulation greater than a threshold of 900 m 2 . The user provided pour points, which were used to build the stream network and identify the sites with the highest flow accumulation. Pour locations were determined assuming that the calculated drainage basins encompassed the entire river basins. A stream order was determined using a stream ordering method 35,36 . Strahler's ordering  www.nature.com/scientificreports/ scheme places streams without tributaries in the first order, streams with two first-order tributaries in the second order, and so on. A well-drained basin comprises fifth-order stream channels 35 .
To assess flood susceptibility, 24 morphometric variables (linear, areal, shape, and relief) were selected from the study area and computed employing an established formula (Table 1). Stream order (No), stream number (Nu), stream length (Lu), sinuosity ratio (Si), and length of overland flow (Lo) were calculated and ranked as linear drainage morphometric criteria. Drainage density (Dd), drainage frequency (Fs), stream texture (Dt), constant of channel maintenance (C), and infiltration ratio (If) were also extracted and included as areal drainage aspects. Shape parameters calculated include the elongation ratio (Re), circulatory ratio (Rc), compactness coefficient (Cc), form factor (Ff), and shape index (Sw), whereas relief parameters include the basin relief (Bh), relief ratio (Rr), and ruggedness number (Rn). PCA represents a multivariate statistical method commonly and efficiently used to reduce data size while preserving as many changes as possible in the dataset for faster data processing 37 . Many variables can be reduced to significant factors using PCA, and the final factors summarize the original data. XLSTAT software was used to perform PCA for 24 morphometric variables in 24 watersheds. Because the approach is heavily reliant on the sum variance of the original variables, it is critical to represent  www.nature.com/scientificreports/ the variables in a standard form, implying the same unit of measurement for each variable, to ensure a sample variance equal to 1. The correlation matrix is then examined with the total variance set to n. The inter-correlation matrix is obtained by standardizing the factors based on the following formula: where X denotes the matrix of standardized parameters; x ij denotes the ith observation on the jth parameters; i = 1…N (Number of observation); j = 1…P (Number of observations); x j denotes the jth parameters; and Sj denotes the standard deviation of the jth parameters. The correlation matrix of parameters is the minor product moment of the standardized predictor measures divided by N and is given as follows: where, x ′ denotes the transpose of the standardized matrix of predictor parameters. The principal component loading matrix, which indicates how much an individual parameter is correlated with different factors, is obtained by pre-multiplying the characteristics vector with the square root of the characteristic values of the correlation matrix. Thus, where A denotes the principal component loading matrix; Q denotes the characteristics vector of the correlation matrix; and D denotes the characteristics value of the correlation matrix. A rank (y) was determined for each basin using a relative ranking method based on the degree of flood susceptibility, where 1 indicates a very low susceptibility and 5 is a very high susceptibility in proportion to the value of each morphometric parameter (x). A linear interpolation method described by 38 was used to calculate ranks. When the morphometric variable and the flood event have a positive correlation, Eq. (4) is used. Otherwise, Eq. (5) is used.
The flood risk degree increases with the parameter value for group 1 (Eq. 4), whereas the flood risk degree decreases with the parameter value for group 2 (Eq. 5). Finally, the number of basins was determined by adding www.nature.com/scientificreports/ the standardized scores for each attribute. The calculated data were used to classify the basins into groups based on the risk and degree of flood vulnerability. Regardless of the method used for validation, it is essential to validate the flood susceptibility maps 39 . Morphometric parameters such as Rb, Dd, and Fs were usually compared to the proposed model in order to validate the use of the flood susceptibility map 16,[40][41][42] . In this regard, basin divided into three zones of the vulnerable flood basins based on the Rb in conjunction with Dd and Fs 41 . Zone "A" represents low flood susceptibility, zone "B" represents a high-flood risk, and zone "C" represents an intermediate flood risk. Furthermore, known flood areas from past years were used to validate the final flood susceptibility map. In the present study, the same morphometric parameters (Rb, Dd, and Fs) of the drainage basins are computed and graphically presented for comparing and validating the obtained results. For additional validation, the past flood locations from the International Disaster Database (EM-DAT) are compared to the identified vulnerable areas.

Results
Morphometric analysis. The peak discharge of the river, which causes flash floods, is significantly influenced by the geomorphometric features of the basins. For the 24 sub-basins (see supplementary, Table S1), the basin area (A) ranges from 4618 km 2 in the central regions (basin no. 14) to 107.61 km 2 in the northern regions (basin no. 20). In addition, basin perimeter (P) ranged from 57.71 km 2 for basin no. 21 to 338.49 km 2 for basin no. 14. The number of streams is relatively high for the central basins, such as 3, 4, 9, 14, and 16, indicating a significant discharge capability and flooding susceptibility. Among the 24 watersheds, 11 are assigned as fourthorder basins, 9 are fifth-order, 4 are sixth order, and 1 represents seventh order (see supplementary, Table S1). Further, the stream length characteristics of the different watersheds exhibited a noticeable variation, with basin no. 14 having the highest (2048 km) and basin no. 21 having the lowest (89 km) total length (Fig. 3). The overland flow (L O ) significantly impacts the physiographical, topographical, and hydrological development of these streams in the drainage basins. The sinuosity index (Si) ranges from 0.54 to 1.41. risk of erosion 43 .
Because Dd has a direct impact on the flood hydrograph, susceptibility modeling heavily relies on it as a geomorphometric factor. The results of this study showed a variation between the upper basins in the central (highest values) and southern areas (lowest values), owing to variations in the slope, soil infiltration, and rainfall throughout the study area (Fig. 3). The higher infiltration numbers in the central and southern basins indicate lower infiltration and higher drainage conditions. The reciprocal of drainage density was used 44 to define the morphological property of drainage basins regarding a constant of channel maintenance (C). Similarly, the Fs values of the studied basins range from 0.06 to 3.97 (see supplementary, Table S2). Such considerable values in the central areas imply moderate to high runoff and substantial peak discharge, posing flash flood risks compared to other basins within the study area (see supplementary, Figure S1). Furthermore, the Dt values vary from 0.38 www.nature.com/scientificreports/ to 8.95, with high values indicating very coarse texture as well as the presence of unstable slope materials, soft rocks, and an increased the calculated circularity ratios (Rc) for basins are between 0.19 (basin no. 9) and 0.62 (basin no. 13). Moreover, the elongation ratio (Re) of the basins ranges from 0.27 to 1.0, indicating that most watersheds are circular, with a few exceptions being elongated. Additionally, the shape factors for the basins range from 0.25 to 15.74, confirming their circular nature and making them more susceptible to flooding and experiencing greater erosion and sediment transport capacities. Furthermore, the compactness coefficient (Cc) and form factor (Ff) are both used to determine the relationship between the hydrological conditions and the circular shape of the drainage basin. All watersheds have a Cc of ≥ 1, indicating less elongation, high erosion, and high flood susceptibility 45 . Horton (1945) introduced Ff to forecast the flow intensity of a drainage basin manipulating the direct relationship between peak discharge and the inverse relationship between the squares of the axial length of the drainage basin. Ff values for the basins range from 0.06 to 3.7. Watersheds with high Ff values experience high peak flows with short duration. In contrast, a long watershed with a low form factor experiences modest peak flows that last longer. The period of concentration (Tc), a measure of the physical properties of a basin, is often low in the central and southern basins (see Supplementary, Figure S2), supporting the latter's high vulnerability to flooding. Basin relief (Bh) influences stream gradients, flooding patterns, and the number of transported sediments. Basins 1 and 2 have the lowest basin relief (50 m), whereas basin no. 14 has the highest (2770 m). In the study area, steep slopes dominated the northern and central basins, while gentle slopes dominated the southern basins (see supplementary, Table S2).

Selection of the causative factors. The correlation matrix of 24 geomorphic parameters indicates four
classes (see Supplementary, Table S3): First, excellent correlations were found between basin perimeter and basin area, total stream number and basin area, stream frequency and density, channel maintenance and stream frequency, infiltration number, drainage density and frequency, channel maintenance, Lo, and ruggedness number. Second, a significant correlation (correlation coefficient ˃0.75) was noted between total stream number and basin length, total stream number and total stream length, total stream length and perimeter of the basin, slope and form factor, ruggedness number and drainage density, stream frequency, as well as Lo and time of concentration. Third, a moderate correlation (correlation coefficient ˃0.60) existed between basin length and perimeter, total stream length and basin length, length of the main channel and total stream length, basin width and basin area, drainage texture and drainage density, length of overland flow, channel maintenance, relative relief ratio and compactness coefficient, as well as slope and elongation ratio. In addition, a weak correlation (correlation coefficient ˃0.7) existed between the time of concentration and total stream number, basin width and form factor, basin width and infiltration number, drainage texture and total stream number, and ruggedness number and basin width. At this stage, it is challenging to classify the parameters and determine the significance of the various aspects. Therefore, the PCA was used. The component loading matrix derived from the correlation matrix shows that the first four components, whose eigenvalues are greater than 1, account for about 85.3% of the total explained variance ( Table 2). The rotated component matrix indicates that components are significant. Component 1 (areal parameters) is highly correlated with density, frequency, length of overland flow, constant of maintenance, infiltration number, and raggedness number, whereas component 2 (F2) is strongly correlated with the basin area, perimeter, basin length, total stream length and number, drainage density, main channel length, and time of concentration ( Table 3).
As previously stated, the most crucial morphometric characteristics exhibit high correlation. As a result, only twelve of the twenty-four parameters used in this study are used to map flood susceptibility. Thus, the PCA was useful in removing the variables with the least relevance and grouping the remaining variables into physically significant components.
Flood susceptibility mapping. The degree of flash flood susceptibility is determined for 24 basins using the geomorphometric ranking approach (see Supplementary, Table S4). The approach has been applied by many researchers to evaluate the impacts of the morphometric characteristics of the drainage basin on flood www.nature.com/scientificreports/ susceptibility 14,46 . The method was used to calculate the cumulative ranking score for various geomorphometric features divided into two groups: positively correlated features (total stream number and length, basin area, drainage density, frequency and texture, and ruggedness number) and negatively correlated features (basin length, length of overland flow, time of concentration, stream maintenance, and infiltration number). The cumulative ranking score was used to divide the study area into susceptibility zones using geomorphometric data. The geomorphometric number is inversely proportional to the flash flood susceptibility of basins. According to the investigation, all basins have a geomorphometric number between 26 and 39. Out of the 24 basins in the present study, basins 16, 3, and 14 have the highest geomorphometric values (36)(37)(38)(39), signifying the zone most vulnerable to flash floods and making up a maximum land area of 38.58% of the study area. The moderately susceptible basins are basins 4, 8, 9, 10, 12, and 15, which cover a land area of 27.86% and have geomorphometric numbers between 30 and 35. The remaining basins comprise 33.47% of the study; they have limited exposure to floods and have geomorphometric numbers < 30 (Fig. 4). It is critical to validate flood susceptibility to illustrate how effectively the map performed. Rb is considered to represent the most important parameter for determining the degree of risk and has a greater impact than the other parameters 41 . Rb versus Dd and Fs relationships were displayed in Fig. 5. This method differs from PCA and ranking methods, which equalize the impacts of all parameters and are screened to identify the most significant factors. This difference can be attributed to the various weights that are assigned to the various parameters.

Discussions
In arid regions, flash floods are among the most devastating natural disasters, responsible for loss of life and serious damage to vital infrastructure and buildings 47 . Reliable and accurate data on flash flooding in arid environments are frequently lacking due to the remoteness and sparse habitation of such areas 21 . Delineation of flood susceptible areas is considered the basic action toward flood mitigation. In this study, the flood-prone areas are delineated through the integration of flood causative factors in a GIS environment, PCA, and ranking method. Results of the PCA indicate that the hydro-morphometric characteristics of the drainage basins can be combined into two main groups that account for 85% of the variation of the data and have clear physical meaning. The PC represents drainage density, stream frequency, length of overland flow, and constant maintenance as the most important predictors of flash flooding, which is consistent with what is commonly believed. Consequently, basins of the eastern areas are more susceptible to flooding when compared to other basins throughout the study area. Typically, basins in these areas are characterized by a high drainage density, high stream frequency, and high overland flow indicating high runoff potentiality and less infiltration 48,49 . In addition to geomorphic features, the duration of rainfall, soil erosion, increased population, urbanization and settlements, increased infrastructure, the efficiency of the sewer system, and public awareness, could contribute to flash floods 50 . In recent years, Climate www.nature.com/scientificreports/ change including variability in temperature, rainfall patterns, and storm activity is increasing 51 . There is rising evidence that the global climate is changing in response to natural and human factors 52 . Spatial distribution of the average annual rainfall indicates a variation from the western to the eastern parts of the study area (Fig. 6). Six heavy rainfall events were recorded between 2008 and 2018 53 .
Over the study area, intense rainfall has returned almost every two years, causing numerous flash floods with no warning. Results show that the total annual rainfall of the study area ranges from 244 to 633 mm, with an average of 436 mm (see Supplementary, Figure S3). The maximum annual daily rainfall varies from 12 to 45 mm with an average of 25 mm. In addition, the number of rainy days ranges between 111 to 234 mm, with an average of about 194 days/year. Based on the maximum daily rainfall data, statistical investigation is accomplished through the Weibull method and examined by probability distribution to fit the rainfall data to reach the best distribution curve, which matches the actual data of each rainfall station. From this graph, it is evident  www.nature.com/scientificreports/ that that in the following periods (5, 10, and a 100 years), an increase in the rainfall estimates is predicted (see Supplementary, Figure S3). Moreover, changes in land use/cover significantly affect the occurrence of surface water, infiltration, evapotranspiration, and soil erosion 54,55 . Evidently, the most common change that occurs in this area represents a transition from earlier natural to impervious surfaces. As cities expand due to population growth, fewer wetlands and more built-up regions exist. Infrastructure in populated regions is associated with an increase in impervious surface, which changes the hydrologic characteristics of the area (see supplementary, Figure S4). Therefore, soil infiltration decreases, surface runoff increases, and more sediment loads are transported into water bodies downstream, contributing to flooding. Thus, changes in land use and land cover cause soil compaction, reducing soil infiltration, and increasing runoff volume and speed 56 . The sewer system in the study area is old, disconnected, and insufficient to reduce the effectiveness of runoff during flood events. During the most intense rainfall events, outflow occurred due to sewer system failure. Finally, poor sewer system management and a lack of public awareness comprise factors that contribute to flash floods in this area. For proper flood management, clean and efficient inlets and sewer lines are essential.
A significant outcome of the present study is a regional flood susceptibility map of an arid and vulnerable climate area, which is one of the most flood-affected regions of the country. The value of the current work was expected to be limited by the large basin characteristics, which, in turn, could impact the expectation of the results to differ on a smaller scale. Interestingly, a comparison of the regional study results (here is the current study) with the results of local studies in the same area is beneficial. Therefore, this study shows that a large set of hydro-morphometric parameters can be reduced to a smaller set without loss of information, indicating redundancy in the data. The final map was certified for greater accuracy by comparing the flood hazard map with historical flood data from the International Disaster Database (EM-DAT), which highlights flood-prone locations (Table 4) (Table 4). Additionally, the investigation showed that 44.56% of all human settlements are in extremely sensitive basins (Choman, Mergasur, Soran, and Erbil). It is expected that the prepared map will help disaster managers,  www.nature.com/scientificreports/ planners, and local authorities to improve their understanding of flood vulnerability in the region. It can also be expected that the map will assist in guiding the operational responses of the various authorities in the region, especially in areas that have been identified as high-risk zones. The most effective strategy to lessen flooding is to implement a variety of carefully considered strategies that are suited to the specific catchment of interest 57 assert that no single mitigation approach has ever been able to effectively address the flood problem. Consequently, they claimed a better strategy that incorporates land use planning, engineering fixes, flood preparedness, and emergency management in the affected lowlands. This strategy would also consider the social and economic needs of communities in both the highland, which are frequently source areas and the lowland. While there is typically nothing that can be done to reduce the rainfall amount that a region receives, there are effective steps that can be taken to reduce the amount of surface runoff, which is a source of flooding. To mitigate the effects of the increased impervious surface caused by urban dwelling developments, rainwater harvesting is considered as an effective method for achieving this. The technique has been used in many developing countries with inadequate drainage systems as a means of mitigating some anthropogenically induced ecological problems [58][59][60] . Along with rainwater harvesting, there should be efficient land use/landcover management practices that could reduce the flood hydrograph peak which is usually of utmost concern. Reducing impermeable surfaces, unless essential, can also result in greater advantages. In order to effectively and sustainably reduce flooding, the use of materials and construction designs like permeable pavements for parking lots and sidewalks, green roofs, rain gardens, retention cisterns, and natural rehabilitation are recommended.

Conclusions
In this study, an effective mitigation technique for achieving early flood warning and effective flood crisis management by accurately mapping flood susceptibility in the Erbil area of northern Iraq was implemented through the integration of geospatial techniques, watershed characteristics, and PCA, flood risk zones were identified. The PCA analysis reduced the initial set of 24 morphometric parameters to two important statistical components, using the component loading matrix, which provided insights into the correlations between the morphometric variables and helped to identify the key factors influencing flood susceptibility. Based on the morphometric relative ranking, the study categorized the drainage basins into three classes: low, moderate, and high flood susceptibility. The analysis revealed that three basins have a considerable degree of flood susceptibility, six basins have a moderate degree, and the remaining basins have a low degree. The produced susceptibility mapping was validated by comparison with historical flood occurrences in the study area. The observed correlation between the identified flood-susceptible areas and past flood damages, such as economic losses and human casualties, further supports the accuracy of the flood susceptibility mapping approach. The integrated methodology of morphometric analysis, PCA, and GIS can be recommended as a valuable tool for efficient planning, management, and decision-making to detect flood hazards, create flood mitigation plans, and enhance the resilience of the region to future flood risks. In contrast to other methods routinely used for creating susceptibility maps, the present methodology requires considerably simpler input data, notably morphometric data, which can be readily retrieved from accessible DEMs with good resolution. The proposed method does not require other parameters, such as an inventory map, which is time-consuming and expensive to build. To fully curb the problem of flooding, it is highly recommended to include artificial intelligence in predicting the flood phenomenon in the region because of its importance in determining future plans and preventing flood dangers in the region.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.